Search CORE

42 research outputs found

Detecting Sockpuppets in Deceptive Opinion Spam

Author: Chih-Chung Chang
DH Fusilier
E Stamatatos
M Koppel
N Graham
T Qian
Vladimir N. Vapnik
Xinxing Xu
Publication venue
Publication date: 09/03/2017
Field of study

This paper explores the problem of sockpuppet detection in deceptive opinion spam using authorship attribution and verification approaches. Two methods are explored. The first is a feature subsampling scheme that uses the KL-Divergence on stylistic language models of an author to find discriminative features. The second is a transduction scheme, spy induction that leverages the diversity of authors in the unlabeled test set by sending a set of spies (positive samples) from the training set to retrieve hidden samples in the unlabeled test set using nearest and farthest neighbors. Experiments using ground truth sockpuppet data show the effectiveness of the proposed schemes.Comment: 18 pages, Accepted at CICLing 2017, 18th International Conference on Intelligent Text Processing and Computational Linguistic

arXiv.org e-Print Archive

Crossref

Near-optimal Linear Decision Trees for k-SUM and Related Problems

Author: Cardinal Jean
Ezra Esther
Gold Omer
Goto Eiichi
Pettie Seth
Vapnik Vladimir N
Vapnik VN
Williams Virginia Vassilevska
Publication venue: eScholarship, University of California
Publication date: 01/06/2019
Field of study

We construct near-optimal linear decision trees for a variety of decision problems in combinatorics and discrete geometry. For example, for any constant k , we construct linear decision trees that solve the k -SUM problem on n elements using O ( n log 2 n ) linear queries. Moreover, the queries we use are comparison queries, which compare the sums of two k -subsets; when viewed as linear queries, comparison queries are 2 k -sparse and have only { −1,0,1} coefficients. We give similar constructions for sorting sumsets A+B and for solving the SUBSET-SUM problem, both with optimal number of queries, up to poly-logarithmic terms. Our constructions are based on the notion of “inference dimension,” recently introduced by the authors in the context of active classification with comparison queries. This can be viewed as another contribution to the fruitful link between machine learning and discrete geometry, which goes back to the discovery of the VC dimension

Crossref

eScholarship - University of California

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Author: Bartlett P. L.
De L.
Koltchinskii V.
Toivonen H.
Vapnik Vladimir N.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Padova

Do Prices Coordinate Markets?

Author: Babaioff Moshe
Balcan Maria-Florina
Blum Avrim
Blum Avrim
Cesa-Bianchi Nicolò
Daniely Amit
Dhangwatnotai Peerapong
Elkind Edith
Littlestone Nick
Medina Andres Munoz
Morgenstern Jamie
Murota Kazuo
Oxley James G
Vapnik Vladimir N
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/06/2016
Field of study

Walrasian equilibrium prices can be said to coordinate markets: They support a welfare optimal allocation in which each buyer is buying bundle of goods that is individually most preferred. However, this clean story has two caveats. First, the prices alone are not sufficient to coordinate the market, and buyers may need to select among their most preferred bundles in a coordinated way to find a feasible allocation. Second, we don't in practice expect to encounter exact equilibrium prices tailored to the market, but instead only approximate prices, somehow encoding "distributional" information about the market. How well do prices work to coordinate markets when tie-breaking is not coordinated, and they encode only distributional information? We answer this question. First, we provide a genericity condition such that for buyers with Matroid Based Valuations, overdemand with respect to equilibrium prices is at most 1, independent of the supply of goods, even when tie-breaking is done in an uncoordinated fashion. Second, we provide learning-theoretic results that show that such prices are robust to changing the buyers in the market, so long as all buyers are sampled from the same (unknown) distribution

arXiv.org e-Print Archive

Crossref

On the Power of Democratic Networks

Author: E. N. Mayoraz
Muroga Saburo
Nabutovsky Dimitry
Vapnik Vladimir
Zuev Y. A.
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date
Field of study

Crossref

High-probability minimax probability machines

Author: AW Marshall
C Cortes
D Bertsimas
GRG Lanckriet
J Shawe-Taylor
John Shawe-Taylor
K Huang
M Marchand
N Alon
RA Fisher
Simon Cousins
V Vapnik
V Vapnik
Vladimir Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Large margin vs. large volume in transductive learning

Author: C. Bennett
D. Coppersmith
Dmitry Pechyony
F. Collobert
G. Forsythe
L. Lovasz
O. Bousquet
O. Chapelle
P. Derbeko
R. Horn
Ran El-Yaniv
S. Tong
V. N. Vapnik
V. N. Vapnik
Vladimir Vapnik
W. Gander
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Identifying Human Kinase-Specific Protein Phosphorylation Sites by Integrating Heterogeneous Information from Various Sources

Author: A Kreegipuu
AD Sharrocks
B Stillman
C-CCaC-J Lin
EJ Chang
F Diella
F Diella
F Gnad
G Manning
JA Ubersax
JH Kim
LA Pinna
LM Iakoucheva
M Wagner
MB Yaffe
N Blom
N Blom
Nanfang Xu
P Akamine
Pufeng Du
R Linding
S Yao
T Li
T Oelgeschlager
T Pawson
TH Dang
Tingting Li
V Andres
Vladimir N. Uversky
VN Vapnik
W Zachariae
Y Xue
Publication venue: Public Library of Science
Publication date: 15/11/2010
Field of study

Phosphorylation is an important type of protein post-translational modification. Identification of possible phosphorylation sites of a protein is important for understanding its functions. Unbiased screening for phosphorylation sites by in vitro or in vivo experiments is time consuming and expensive; in silico prediction can provide functional candidates and help narrow down the experimental efforts. Most of the existing prediction algorithms take only the polypeptide sequence around the phosphorylation sites into consideration. However, protein phosphorylation is a very complex biological process in vivo. The polypeptide sequences around the potential sites are not sufficient to determine the phosphorylation status of those residues. In the current work, we integrated various data sources such as protein functional domains, protein subcellular location and protein-protein interactions, along with the polypeptide sequences to predict protein phosphorylation sites. The heterogeneous information significantly boosted the prediction accuracy for some kinase families. To demonstrate potential application of our method, we scanned a set of human proteins and predicted putative phosphorylation sites for Cyclin-dependent kinases, Casein kinase 2, Glycogen synthase kinase 3, Mitogen-activated protein kinases, protein kinase A, and protein kinase C families (avaiable at http://cmbi.bjmu.edu.cn/huphospho). The predicted phosphorylation sites can serve as candidates for further experimental validation. Our strategy may also be applicable for the in silico identification of other post-translational modification substrates

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Subjectivity in Inductive Inference

Author: Andrei N Kolmogorov
Andrei N Kolmogorov
Bertrand Russell
C Jeffrey
Christopher S Wallace
Christopher S Wallace
Christopher S Wallace
Daniel Kahneman
Gabrielle Gayer
Gideon Schwarz
Gregory J Chaitin
Herbert Simon
Hirotugu Akaike
Itzhak Gilboa
Itzhak Gilboa
J Ray
John E Hopcraft
Jorma Rissanen
Larry Samuelson
Ludwig Wittgenstein
N Vladimir
Nelson Goodman
Thomas S Kuhn
Vladimir N Vapnik
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

Crossref

Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data

Author: A Bhattacharjee
A Butte
A Dupuy
A Potti
A Rosenwald
A Statnikov
A Statnikov
A Statnikov
Alexander Statnikov
AM Glas
B Freidlin
Bryan E. Shepherd
CF Aliferis
Constantin F. Aliferis
CX Ling
DG Beer
DJ Hand
EJ Yeoh
EL Lehmann
FE Harrell Jr
Frank E. Harrell
G Casella
Ioannis Tsamardinos
JA Sparano
Jonathan S. Schildcrout
JP Ioannidis
KK Dobbin
KK Dobbin
L Ein-Dor
L Shi
LA Habel
LJ van't Veer
M Saerens
MD Radmacher
ME Burczynski
MJ Marton
ML Lee
N Iizuka
P Baldi
PI Good
R Kohavi
R Simon
RE Fan
S Michiels
S Mukherjee
S Paik
S Paik
S Ramaswamy
SL Pomeroy
T Bammler
T Hastie
TR Golub
TS Furey
UM Braga-Neto
Vladimir B. Bajic
VN Vapnik
W Jiang
Publication venue: Public Library of Science
Publication date: 17/03/2009
Field of study

Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central